Part I - (Exploration the Ford GoBike System Dataset)¶

by (Abdulmali Alajmi)¶

Introduction¶

The Ford GoBike System Dataset is a valuable resource for understanding the usage patterns and trends of a bike-sharing system. It contains data related to the Ford GoBike system, which is a bike-sharing service that provides a convenient and sustainable transportation option for people in various communities. The dataset includes information on the number of trips taken, trip distances, trip durations, and user demographics, among other things. Analyzing this data can help decision-makers understand how the system is being used, identify areas for improvement, and make data-driven decisions to meet the needs of users. Whether you are a researcher, a transportation planner, or simply someone interested in the use of bike-sharing systems, the Ford GoBike System Dataset provides a wealth of information for exploring and understanding this important transportation option.

Preliminary Wrangling¶

In [143]:
# import all packages and set plots to be embedded inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from haversine import Unit
import haversine as hs 
import plotly.io as pio
pio.renderers.default = "notebook_connected"
import plotly.offline as ofl
ofl.init_notebook_mode()
import plotly.graph_objects as go




%matplotlib inline

Load in your dataset and describe its properties through the questions below. Try and motivate your exploration goals through this section.

In [144]:
# read data file
df =pd.read_csv('201902-fordgobike-tripdata.csv')
In [145]:
df
Out[145]:
duration_sec start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_birth_year member_gender bike_share_for_all_trip
0 52185 2019-02-28 17:32:10.1450 2019-03-01 08:01:55.9750 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 13.0 Commercial St at Montgomery St 37.794231 -122.402923 4902 Customer 1984.0 Male No
1 42521 2019-02-28 18:53:21.7890 2019-03-01 06:42:03.0560 23.0 The Embarcadero at Steuart St 37.791464 -122.391034 81.0 Berry St at 4th St 37.775880 -122.393170 2535 Customer NaN NaN No
2 61854 2019-02-28 12:13:13.2180 2019-03-01 05:24:08.1460 86.0 Market St at Dolores St 37.769305 -122.426826 3.0 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 5905 Customer 1972.0 Male No
3 36490 2019-02-28 17:54:26.0100 2019-03-01 04:02:36.8420 375.0 Grove St at Masonic Ave 37.774836 -122.446546 70.0 Central Ave at Fell St 37.773311 -122.444293 6638 Subscriber 1989.0 Other No
4 1585 2019-02-28 23:54:18.5490 2019-03-01 00:20:44.0740 7.0 Frank H Ogawa Plaza 37.804562 -122.271738 222.0 10th Ave at E 15th St 37.792714 -122.248780 4898 Subscriber 1974.0 Male Yes
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
183407 480 2019-02-01 00:04:49.7240 2019-02-01 00:12:50.0340 27.0 Beale St at Harrison St 37.788059 -122.391865 324.0 Union Square (Powell St at Post St) 37.788300 -122.408531 4832 Subscriber 1996.0 Male No
183408 313 2019-02-01 00:05:34.7440 2019-02-01 00:10:48.5020 21.0 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 66.0 3rd St at Townsend St 37.778742 -122.392741 4960 Subscriber 1984.0 Male No
183409 141 2019-02-01 00:06:05.5490 2019-02-01 00:08:27.2200 278.0 The Alameda at Bush St 37.331932 -121.904888 277.0 Morrison Ave at Julian St 37.333658 -121.908586 3824 Subscriber 1990.0 Male Yes
183410 139 2019-02-01 00:05:34.3600 2019-02-01 00:07:54.2870 220.0 San Pablo Ave at MLK Jr Way 37.811351 -122.273422 216.0 San Pablo Ave at 27th St 37.817827 -122.275698 5095 Subscriber 1988.0 Male No
183411 271 2019-02-01 00:00:20.6360 2019-02-01 00:04:52.0580 24.0 Spear St at Folsom St 37.789677 -122.390428 37.0 2nd St at Folsom St 37.785000 -122.395936 1057 Subscriber 1989.0 Male No

183412 rows × 16 columns

In [146]:
#display data types and non-null
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 183412 entries, 0 to 183411
Data columns (total 16 columns):
 #   Column                   Non-Null Count   Dtype  
---  ------                   --------------   -----  
 0   duration_sec             183412 non-null  int64  
 1   start_time               183412 non-null  object 
 2   end_time                 183412 non-null  object 
 3   start_station_id         183215 non-null  float64
 4   start_station_name       183215 non-null  object 
 5   start_station_latitude   183412 non-null  float64
 6   start_station_longitude  183412 non-null  float64
 7   end_station_id           183215 non-null  float64
 8   end_station_name         183215 non-null  object 
 9   end_station_latitude     183412 non-null  float64
 10  end_station_longitude    183412 non-null  float64
 11  bike_id                  183412 non-null  int64  
 12  user_type                183412 non-null  object 
 13  member_birth_year        175147 non-null  float64
 14  member_gender            175147 non-null  object 
 15  bike_share_for_all_trip  183412 non-null  object 
dtypes: float64(7), int64(2), object(7)
memory usage: 22.4+ MB
In [147]:
df.shape
Out[147]:
(183412, 16)
In [148]:
# diplay the mean and std and overview of data
df.describe()
Out[148]:
duration_sec start_station_id start_station_latitude start_station_longitude end_station_id end_station_latitude end_station_longitude bike_id member_birth_year
count 183412.000000 183215.000000 183412.000000 183412.000000 183215.000000 183412.000000 183412.000000 183412.000000 175147.000000
mean 726.078435 138.590427 37.771223 -122.352664 136.249123 37.771427 -122.352250 4472.906375 1984.806437
std 1794.389780 111.778864 0.099581 0.117097 111.515131 0.099490 0.116673 1664.383394 10.116689
min 61.000000 3.000000 37.317298 -122.453704 3.000000 37.317298 -122.453704 11.000000 1878.000000
25% 325.000000 47.000000 37.770083 -122.412408 44.000000 37.770407 -122.411726 3777.000000 1980.000000
50% 514.000000 104.000000 37.780760 -122.398285 100.000000 37.781010 -122.398279 4958.000000 1987.000000
75% 796.000000 239.000000 37.797280 -122.286533 235.000000 37.797320 -122.288045 5502.000000 1992.000000
max 85444.000000 398.000000 37.880222 -121.874119 398.000000 37.880222 -121.874119 6645.000000 2001.000000

What is the structure of your dataset?¶

the dataset contains 16 colunms and 183,412 rows , there are incorrect colunm data type such as:(1) start_time and end_time must be date time , (2) start_station_id ,end_station_id member_birth_year must be int

What is/are the main feature(s) of interest in your dataset?¶

The most importent feature i think is user_type , i would like to lock for the reaasons that lead peason to subscribe

What features in the dataset do you think will help support your investigation into your feature(s) of interest?¶

i will creat new colunm that called distence to investegte if that affect on user_type or not, also there are the colunms such as age and gender

In [149]:
# calculte how mane null valuse in each colunm
df.isna().sum()
Out[149]:
duration_sec                  0
start_time                    0
end_time                      0
start_station_id            197
start_station_name          197
start_station_latitude        0
start_station_longitude       0
end_station_id              197
end_station_name            197
end_station_latitude          0
end_station_longitude         0
bike_id                       0
user_type                     0
member_birth_year          8265
member_gender              8265
bike_share_for_all_trip       0
dtype: int64
In [150]:
# dupliction check
df.duplicated().sum()
Out[150]:
0
In [151]:
# drop null vluse
df.dropna(inplace=True)
In [152]:
# Change data types to correct types

df['start_time'] = pd.to_datetime(df['start_time'])
df['end_time'] = pd.to_datetime(df['end_time'])
df['start_station_id'] = df['start_station_id'].astype('int')
df['end_station_id'] = df['end_station_id'].astype('int')
df['member_birth_year'] = df['member_birth_year'].astype('int')

Add new columns¶

In [153]:
# Ref ==> https://towardsdatascience.com/calculating-distance-between-two-geolocations-in-python-26ad3afe287b
def distance(obs):
    loc1=(obs['start_station_latitude'], obs['start_station_longitude'])
    loc2=(obs['end_station_latitude'], obs['end_station_longitude'])
    return hs.haversine(loc1,loc2,unit=Unit.METERS)

df['distance'] = df.apply(distance,axis=1)
In [154]:
# convert duration_sec to minuts and drop duration_sec
df['duration_min'] = df['duration_sec'] / 60
df.drop('duration_sec',axis=1,inplace=True)
In [155]:
# creat new colunm caled age
df['age']=2019 - df['member_birth_year']
In [156]:
df.describe()
Out[156]:
start_station_id start_station_latitude start_station_longitude end_station_id end_station_latitude end_station_longitude bike_id member_birth_year distance duration_min age
count 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000
mean 139.002126 37.771220 -122.351760 136.604486 37.771414 -122.351335 4482.587555 1984.803135 1690.051442 11.733379 34.196865
std 111.648819 0.100391 0.117732 111.335635 0.100295 0.117294 1659.195937 10.118731 1096.958237 27.370082 10.118731
min 3.000000 37.317298 -122.453704 3.000000 37.317298 -122.453704 11.000000 1878.000000 0.000000 1.016667 18.000000
25% 47.000000 37.770407 -122.411901 44.000000 37.770407 -122.411647 3799.000000 1980.000000 910.444462 5.383333 27.000000
50% 104.000000 37.780760 -122.398279 101.000000 37.781010 -122.397437 4960.000000 1987.000000 1429.831313 8.500000 32.000000
75% 239.000000 37.797320 -122.283093 238.000000 37.797673 -122.286533 5505.000000 1992.000000 2224.012981 13.150000 39.000000
max 398.000000 37.880222 -121.874119 398.000000 37.880222 -121.874119 6645.000000 2001.000000 69469.336637 1409.133333 141.000000
In [157]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 174952 entries, 0 to 183411
Data columns (total 18 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   start_time               174952 non-null  datetime64[ns]
 1   end_time                 174952 non-null  datetime64[ns]
 2   start_station_id         174952 non-null  int32         
 3   start_station_name       174952 non-null  object        
 4   start_station_latitude   174952 non-null  float64       
 5   start_station_longitude  174952 non-null  float64       
 6   end_station_id           174952 non-null  int32         
 7   end_station_name         174952 non-null  object        
 8   end_station_latitude     174952 non-null  float64       
 9   end_station_longitude    174952 non-null  float64       
 10  bike_id                  174952 non-null  int64         
 11  user_type                174952 non-null  object        
 12  member_birth_year        174952 non-null  int32         
 13  member_gender            174952 non-null  object        
 14  bike_share_for_all_trip  174952 non-null  object        
 15  distance                 174952 non-null  float64       
 16  duration_min             174952 non-null  float64       
 17  age                      174952 non-null  int32         
dtypes: datetime64[ns](2), float64(6), int32(4), int64(1), object(5)
memory usage: 22.7+ MB
In [158]:
df.drop('member_birth_year',axis=1, inplace = True)
In [159]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 174952 entries, 0 to 183411
Data columns (total 17 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   start_time               174952 non-null  datetime64[ns]
 1   end_time                 174952 non-null  datetime64[ns]
 2   start_station_id         174952 non-null  int32         
 3   start_station_name       174952 non-null  object        
 4   start_station_latitude   174952 non-null  float64       
 5   start_station_longitude  174952 non-null  float64       
 6   end_station_id           174952 non-null  int32         
 7   end_station_name         174952 non-null  object        
 8   end_station_latitude     174952 non-null  float64       
 9   end_station_longitude    174952 non-null  float64       
 10  bike_id                  174952 non-null  int64         
 11  user_type                174952 non-null  object        
 12  member_gender            174952 non-null  object        
 13  bike_share_for_all_trip  174952 non-null  object        
 14  distance                 174952 non-null  float64       
 15  duration_min             174952 non-null  float64       
 16  age                      174952 non-null  int32         
dtypes: datetime64[ns](2), float64(6), int32(3), int64(1), object(5)
memory usage: 22.0+ MB
In [160]:
df.describe()
Out[160]:
start_station_id start_station_latitude start_station_longitude end_station_id end_station_latitude end_station_longitude bike_id distance duration_min age
count 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000 174952.000000
mean 139.002126 37.771220 -122.351760 136.604486 37.771414 -122.351335 4482.587555 1690.051442 11.733379 34.196865
std 111.648819 0.100391 0.117732 111.335635 0.100295 0.117294 1659.195937 1096.958237 27.370082 10.118731
min 3.000000 37.317298 -122.453704 3.000000 37.317298 -122.453704 11.000000 0.000000 1.016667 18.000000
25% 47.000000 37.770407 -122.411901 44.000000 37.770407 -122.411647 3799.000000 910.444462 5.383333 27.000000
50% 104.000000 37.780760 -122.398279 101.000000 37.781010 -122.397437 4960.000000 1429.831313 8.500000 32.000000
75% 239.000000 37.797320 -122.283093 238.000000 37.797673 -122.286533 5505.000000 2224.012981 13.150000 39.000000
max 398.000000 37.880222 -121.874119 398.000000 37.880222 -121.874119 6645.000000 69469.336637 1409.133333 141.000000
In [161]:
# creat new colunms that displys days and hours
df['day'] = df['start_time'].dt.weekday
df['hour'] = df['start_time'].dt.hour
day_dic={ 0: 'Saturday', 1: 'Sunday',2: 'Monday', 3: 'Tuesday', 4: 'Wednesday', 5: 'Thursday', 6: 'Friday'}
df['day']=df['day'].map(day_dic)
In [162]:
df.head(5)
Out[162]:
start_time end_time start_station_id start_station_name start_station_latitude start_station_longitude end_station_id end_station_name end_station_latitude end_station_longitude bike_id user_type member_gender bike_share_for_all_trip distance duration_min age day hour
0 2019-02-28 17:32:10.145 2019-03-01 08:01:55.975 21 Montgomery St BART Station (Market St at 2nd St) 37.789625 -122.400811 13 Commercial St at Montgomery St 37.794231 -122.402923 4902 Customer Male No 544.709256 869.750000 35 Tuesday 17
2 2019-02-28 12:13:13.218 2019-03-01 05:24:08.146 86 Market St at Dolores St 37.769305 -122.426826 3 Powell St BART Station (Market St at 4th St) 37.786375 -122.404904 5905 Customer Male No 2704.548867 1030.900000 47 Tuesday 12
3 2019-02-28 17:54:26.010 2019-03-01 04:02:36.842 375 Grove St at Masonic Ave 37.774836 -122.446546 70 Central Ave at Fell St 37.773311 -122.444293 6638 Subscriber Other No 260.738904 608.166667 30 Tuesday 17
4 2019-02-28 23:54:18.549 2019-03-01 00:20:44.074 7 Frank H Ogawa Plaza 37.804562 -122.271738 222 10th Ave at E 15th St 37.792714 -122.248780 4898 Subscriber Male Yes 2409.304744 26.416667 45 Tuesday 23
5 2019-02-28 23:49:58.632 2019-03-01 00:19:51.760 93 4th St at Mission Bay Blvd S 37.770407 -122.391198 323 Broadway at Kearny 37.798014 -122.405950 5200 Subscriber Male No 3332.207230 29.883333 60 Tuesday 23
In [163]:
# remove gender that is Other
df = df[df['member_gender'] != 'Other']
In [164]:
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 171305 entries, 0 to 183411
Data columns (total 19 columns):
 #   Column                   Non-Null Count   Dtype         
---  ------                   --------------   -----         
 0   start_time               171305 non-null  datetime64[ns]
 1   end_time                 171305 non-null  datetime64[ns]
 2   start_station_id         171305 non-null  int32         
 3   start_station_name       171305 non-null  object        
 4   start_station_latitude   171305 non-null  float64       
 5   start_station_longitude  171305 non-null  float64       
 6   end_station_id           171305 non-null  int32         
 7   end_station_name         171305 non-null  object        
 8   end_station_latitude     171305 non-null  float64       
 9   end_station_longitude    171305 non-null  float64       
 10  bike_id                  171305 non-null  int64         
 11  user_type                171305 non-null  object        
 12  member_gender            171305 non-null  object        
 13  bike_share_for_all_trip  171305 non-null  object        
 14  distance                 171305 non-null  float64       
 15  duration_min             171305 non-null  float64       
 16  age                      171305 non-null  int32         
 17  day                      171305 non-null  object        
 18  hour                     171305 non-null  int64         
dtypes: datetime64[ns](2), float64(6), int32(3), int64(2), object(6)
memory usage: 24.2+ MB
In [165]:
# create new colunm that shows age group
def age_groub(age):
    if 10<age<=20:
        return '11-20'
    
    elif 20<age<=30:
        return '21-30'  
    
    elif 30<age<=40:
        return '31-40'
    
    elif 40<age<=50:
        return '41-50'
    
    elif 50<age<=60:
        return '51-60'
    else:
        return'>60'

df['age_group'] = df['age'].map(age_groub)
C:\Users\P1\AppData\Local\Temp\ipykernel_25432\802946479.py:20: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [166]:
df.describe()
Out[166]:
start_station_id start_station_latitude start_station_longitude end_station_id end_station_latitude end_station_longitude bike_id distance duration_min age hour
count 171305.00000 171305.000000 171305.000000 171305.000000 171305.000000 171305.000000 171305.000000 171305.000000 171305.000000 171305.000000 171305.000000
mean 138.70695 37.770629 -122.351657 136.304889 37.770831 -122.351225 4481.294136 1687.777172 11.629300 34.160649 13.451545
std 111.71479 0.101225 0.118522 111.421147 0.101130 0.118088 1659.524197 1094.411591 26.287562 10.116083 4.733722
min 3.00000 37.317298 -122.453704 3.000000 37.317298 -122.453704 11.000000 0.000000 1.016667 18.000000 0.000000
25% 47.00000 37.770083 -122.411901 44.000000 37.770407 -122.411647 3796.000000 908.552902 5.366667 27.000000 9.000000
50% 104.00000 37.780760 -122.398279 100.000000 37.781010 -122.397437 4960.000000 1428.327735 8.483333 32.000000 14.000000
75% 239.00000 37.797280 -122.283127 237.000000 37.797320 -122.287610 5505.000000 2217.417808 13.116667 39.000000 17.000000
max 398.00000 37.880222 -121.874119 398.000000 37.880222 -121.874119 6645.000000 69469.336637 1409.133333 141.000000 23.000000
In [167]:
days_ordered=['Saturday','Sunday','Monday','Tuesday','Wednesday','Thursday','Friday']
df['days_ordered'] = pd.Categorical(df['day'], categories=days_ordered, ordered=True)
age_ord=['11-20','21-30', '31-40', '41-50', '51-60', '>60']
df['age_group_ord']=pd.Categorical(df['age_group'], categories=age_ord, ordered=True)
C:\Users\P1\AppData\Local\Temp\ipykernel_25432\208641027.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

C:\Users\P1\AppData\Local\Temp\ipykernel_25432\208641027.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

In [168]:
#Export clean data to csv
df.to_csv("201902-fordgobike-tripdata-clean.csv", index=False)

Univariate Exploration¶

In this section, investigate distributions of individual variables. If you see unusual points or outliers, take a deeper look to clean things up and prepare yourself to look at relationships between variables.

1) User Type¶

In [169]:
plt.figure(figsize=[14,7])
fig=sns.barplot(x=df['user_type'].value_counts().index,y=df['user_type'].value_counts(),palette='mako')
fig.set(ylabel='Frequency',xlabel='User Type',title='Distribution of User Type');

As we see above, the most of users are subscribers

2) gender¶

In [170]:
plt.figure(figsize=[14,7])
fig=sns.barplot(x=df['member_gender'].value_counts().index,y=df['member_gender'].value_counts(),palette='mako')
fig.set(ylabel='Frequency',xlabel='Gender',title='Distribution of gender');

As we see above, the most of users are Male

3)trips each day¶

In [171]:
df['day'].value_counts()
Out[171]:
Tuesday      32984
Sunday       30022
Monday       27825
Wednesday    27083
Saturday     25106
Friday       14183
Thursday     14102
Name: day, dtype: int64
In [172]:
plt.figure(figsize=[14,7])
fig = sns.barplot(x=df['days_ordered'].value_counts().index, y= df['day'].value_counts(),palette='mako')

fig.set(xlabel='day', ylabel='Frequency',title='Number of trips each day' );

As we see above, Friday and Thursday less than other day in terms of users

4) Number of trips each Hour¶

In [173]:
plt.figure(figsize=[14,7])
fig = sns.barplot(x=df['hour'].value_counts().index, y= df['hour'].value_counts(),palette='mako');
fig.set(xlabel='Hour', ylabel='Frequency',title='Number of trips each Hour' );

As we see above, the most of users use this service strting 7:00 AM - 9:00 AM and from 4:00PM to 6:00PM

5) Distribution of Age¶

In [174]:
plt.figure(figsize=[14,7])

fig=sns.histplot(data=df, x="age",kde=True,binwidth=2,palette='mako');
plt.title('Distribution of Age');
C:\Users\P1\AppData\Local\Temp\ipykernel_25432\190368916.py:3: UserWarning:

Ignoring `palette` because no `hue` variable has been assigned.

In [175]:
plt.figure(figsize=[14,7])
fig = sns.barplot(x=df['age_group_ord'].value_counts().index, y= df['age_group'].value_counts(),palette='mako')

fig.set(xlabel='day', ylabel='Frequency',title='Age Group Frequency' );

As we see above, most of users are more less thn 41

In [176]:
plt.figure(figsize=[14,7])
fig =  sns.boxplot(y='age',data=df ,palette='mako')
fig.set( ylabel='User Age',title='Box Plot of User Age' );

As we see above, moore than 75% of users are more than 60 age it seems outlier but i believe is not, because it possible some one his age like 70 and use this service

6) Trip distance¶

In [177]:
plt.figure(figsize=[14,7])
fig =  sns.boxplot(y='distance',data=df ,palette='mako')
fig.set( ylabel='distance (M)',title='Trip distance' );

As we see above, there is an outlire in distance colunm

In [178]:
plt.figure(figsize=[14,7])

plt.hist(df[df['distance']<2000][df['distance']>0]['distance'],edgecolor='black', linewidth=1)
plt.title('Distribution of distance');
C:\Users\P1\AppData\Local\Temp\ipykernel_25432\2611273734.py:3: UserWarning:

Boolean Series key will be reindexed to match DataFrame index.

As we see above, The most of distance are 750-1500 Meter

7) Frequency of Bike is Share For All Trip¶

In [179]:
plt.figure(figsize=[14,7])
fig = sns.barplot(x=df['bike_share_for_all_trip'].value_counts().index,y= df['bike_share_for_all_trip'].value_counts(),palette='mako')
fig.set( ylabel='Frequency',title='Frequency of Bike is Share For All Trip')

plt.show()

As we see above, most bikes are not share for all trips

8) Distribution of Trip Duration¶

In [180]:
plt.figure(figsize=[14,7])
fig =  sns.boxplot(y='duration_min',data=df ,palette='mako')
fig.set( ylabel='distance (M)',title='Box Plot of Trip Duration' );
In [181]:
plt.figure(figsize=[14,7])

fig=sns.histplot(data=df[df['duration_min']<100], x="duration_min",kde=True,binwidth=2,palette='mako');
plt.title('Distribution of Trip Duration');
C:\Users\P1\AppData\Local\Temp\ipykernel_25432\3786305176.py:3: UserWarning:

Ignoring `palette` because no `hue` variable has been assigned.

As we see above, the most of durtion is less then 20M

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?¶

in gender there ware 'other' value i drop it , for user type there is no unusual point , for distance most of users btween 750 to 1500 Meters , the unusual that i found the most of users are in not weekend and also most of them in hour 8,9,18,19

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?¶

i found that there an outlires in copule of featurse such as :

  • Age outlire from 60 -80
  • Distance outlires 0 or more than 3000M

Bivariate Exploration¶

In this section, investigate relationships between pairs of variables in your data. Make sure the variables that you cover here have been introduced in some fashion in the previous section (univariate exploration).

9) Distance By Day Wee¶

In [182]:
plt.figure(figsize=[14,7])
fig = sns.barplot(x=df['days_ordered'], y= df['distance'],palette='mako')

fig.set(xlabel='day', ylabel='Frequency',title='Distance By Day Week' );

As we see above, there is no corroltion btween days and distence

10) Distance By Hour¶

In [183]:
plt.figure(figsize=[14,7])
fig = sns.barplot(x=df['hour'], y= df['distance'],palette='mako')

fig.set(xlabel='day', ylabel='Frequency',title='Distance By Hour' );

As we see above, the clients that start in 5,6,7,8,9 AM hvae higher distence

11) Distance By Age Group¶

In [184]:
plt.figure(figsize=[14,7])
fig = sns.barplot(y=df['duration_min'], x= df['age_group'].sort_values(),palette='mako')

fig.set(xlabel='day', ylabel='Frequency',title='Duration By Age Group' );

As we see above, >60 age mebers have higher Duration

In [185]:
plt.figure(figsize=[14,7])
fig = sns.barplot(y=df['duration_min'], x= df['age_group'].sort_values(),palette='mako')

fig.set(xlabel='day', ylabel='Frequency',title='Distance By Age Group' );

12) User Type By Gender¶

In [186]:
plt.figure(figsize=[14,7])
fig = sns.countplot(data=df,x='user_type',hue='member_gender',palette='mako')

fig.set(xlabel='day', ylabel='Frequency',title='User Type By Gender' );

As we see above, the most of subscribers are male

13) User Type By Gender¶

In [187]:
plt.figure(figsize=[14,7])
fig = sns.countplot(data=df,x='user_type',hue='age_group',palette='mako')

fig.set(xlabel='day', ylabel='Frequency',title='User Type By Gender' );

As we see above, subscribers and customers have the same pattren with age

14) Day By User Type¶

In [188]:
plt.figure(figsize=[14,7])
fig = sns.countplot(data=df,x='day',hue='user_type',palette='mako')

fig.set(xlabel='day', ylabel='Frequency',title='Day By User Type' );

As we see above, most of subscribers used the service on weekdays.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?¶

According to user types i found the most of customers used the bike service on weekends while subscribers used the service on weekdays. Also i found the most of subscribers are Male.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?¶

the most of subscribers are Male and they used the service on weekdays.

Multivariate Exploration¶

Create plots of three or more variables to investigate your data even further. Make sure that your investigations are justified, and follow from your work in the previous sections.

15) correlation between Duration and Distance and Age Group¶

In [189]:
plt.figure(figsize=[14,7])
sns.scatterplot(    data=df[df['distance']<2000],x="duration_min", y="distance", hue="age_group")
plt.title("correlation  between Duration  and Distance and Age Group")
plt.ylabel('Distance ') 
plt.xlabel('Duration ') 
plt.show()

As we see above , there is positive correlation between the duration of the trip and the age group. as we see 21-30 has positive corroltion with distence and duration

16) correlation between Duration and Distance and User Type¶

In [190]:
plt.figure(figsize=[14,7])
sns.scatterplot(    data=df[df['distance']<2000],x="duration_min", y="distance", hue="user_type")
plt.title("correlation  between Duration  and Distance and User Type",size = 18)
plt.ylabel('Distance ') 
plt.xlabel('Duration ') 
plt.show()

As we see above , there is no correlation between the duration of the trip and the User Type.

In [191]:
plt.figure(figsize=[14,7])

sns.heatmap(df.corr(), annot=True,linewidth=.6)
C:\Users\P1\AppData\Local\Temp\ipykernel_25432\660083900.py:3: FutureWarning:

The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.

Out[191]:
<AxesSubplot: >

As we can see in the plot above, the location variables and their numbers are correlated with each other and that is very natural for them to have a strong correlation otherwhise there is no a strong or medium correlation between the variables, most of which are less than 10 percent.

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?¶

i found that there is a positive correlation between the duration of the trip and the age group. which is the 21-30 age has positive corroltion with distence and duration.

Were there any interesting or surprising interactions between features?¶

i thought that there is a positive correlation between user type with distance and duration , that caused the subscribers consume duration and distance, but after visualization there is no

Conclusions¶

The majority of the trips taken on the bike-sharing system are between 750-1500 meters, with a significant portion of users traveling over 1,000 meters. There is no correlation between the day of the week and the distance traveled, as bikes are not shared for every trip. Fridays and Thursdays tend to have fewer users compared to other days, with the majority of users being subscribers. Members aged 60 or older have longer trip durations on average. The busiest times for trips are during the working hours of 6am to 9am and 4pm to 6pm, with Tuesday being the busiest day of the week. The median trip duration is 9 minutes, with most trips lasting around 12 minutes, and the median trip distance is 1.1 km.

In [ ]: